WikiLeaks Document Release 

http://wikileaks.org/wiki/CRS-RL31798 
February 2, 2009 

Congressional Research Service 
Report RL31798 

Data Mining and Homeland Security: An Overview 

Jeffrey W. Seifert, Resources, Science, and Industry Division 
August 27, 2008 

Abstract. Data mining has become increasingly common in both the public and private sectors. Organizations 
use data mining as a tool to survey customer information, reduce fraud and waste, and assist in medical 
research. However, the proliferation of data mining has raised some implementation and oversight issues as 
well. These include concerns about the quality of the data being analyzed, the interoperability of the databases 
and software between agencies, and potential infringements on privacy. Also, there are some concerns that the 
limitations of data mining are being overlooked as agencies work to emphasize their homeland security initiatives. 



http://wikileaks.org/wiki/CRS-RL31798 



Order Code RL31798 



CRS Report for Congress 



Data Mining and Homeland Security: 

An Overview 



Updated August 27, 2008 



Jeffrey W. Seifert 
Specialist in Information Policy and Technology 
Resources, Science, and Industry Division 





http://wikileaks.org/wiki/CRS-RL31798 



Data Mining and Homeland Security: An Overview 



Summary 

Data mining has become one of the key features of many homeland security 
initiatives. Often used as a means for detecting fraud, assessing risk, and product 
retailing, data mining involves the use of data analysis tools to discover previously 
unknown, valid patterns and relationships in large data sets. In the context of 
homeland security, data mining can be a potential means to identify terrorist 
activities, such as money transfers and communications, and to identify and track 
individual terrorists themselves, such as through travel and immigration records. 

While data mining represents a significant advance in the type of analytical tools 
currently available, there are limitations to its capability. One limitation is that 
although data mining can help reveal patterns and relationships, it does not tell the 
user the value or significance of these patterns. These types of determinations must 
be made by the user. A second limitation is that while data mining can identify 
connections between behaviors and/or variables, it does not necessarily identify a 
causal relationship. Successful data mining still requires skilled technical and 
analytical specialists who can structure the analysis and interpret the output. 

Data mining is becoming increasingly common in both the private and public 
sectors. Industries such as banking, insurance, medicine, and retailing commonly use 
data mining to reduce costs, enhance research, and increase sales. In the public 
sector, data mining applications initially were used as a means to detect fraud and 
waste, but have grown to also be used for purposes such as measuring and improving 
program performance. However, some of the homeland security data mining 
applications represent a significant expansion in the quantity and scope of data to be 
analyzed. Some efforts that have attracted a higher level of congressional interest 
include the Terrorism Information Awareness (TIA) project (now-discontinued) and 
the Computer- Assisted Passenger Prescreening System II (CAPPS II) project (now- 
canceled and replaced by Secure Flight). Other initiatives that have been the subject 
of congressional interest include the Multi-State Anti-Terrorism Information 
Exchange (MATRIX), the Able Danger program, the Automated Targeting System 
(ATS), and data collection and analysis projects being conducted by the National 
Security Agency (NS A). 

As with other aspects of data mining, while technological capabilities are 
important, there are other implementation and oversight issues that can influence the 
success of a project’s outcome. One issue is data quality, which refers to the 
accuracy and completeness of the data being analyzed. A second issue is the 
interoperability of the data mining software and databases being used by different 
agencies. A third issue is mission creep, or the use of data for purposes other than 
for which the data were originally collected. A fourth issue is privacy. Questions 
that may be considered include the degree to which government agencies should use 
and mix commercial data with government data, whether data sources are being used 
for purposes other than those for which they were originally designed, and possible 
application of the Privacy Act to these initiatives. It is anticipated that congressional 
oversight of data mining projects will grow as data mining efforts continue to evolve. 
This report will be updated as events warrant. 
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Data Mining and Homeland Security: 
An Overview 



What Is Data Mining? 

Data mining involves the use of sophisticated data analysis tools to discover 
previously unknown, valid patterns and relationships in large data sets . 1 These tools 
can include statistical models, mathematical algorithms, and machine learning 
methods (algorithms that improve their performance automatically through 
experience, such as neural networks or decision trees). Consequently, data mining 
consists of more than collecting and managing data, it also includes analysis and 
prediction. 

Data mining can be performed on data represented in quantitative, textual, or 
multimedia forms. Data mining applications can use a variety of parameters to 
examine the data. They include association (patterns where one event is connected 
to another event, such as purchasing a pen and purchasing paper), sequence or path 
analysis (patterns where one event leads to another event, such as the birth of a child 
and purchasing diapers), classification (identification of new patterns, such as 
coincidences between duct tape purchases and plastic sheeting purchases), clustering 
(finding and visually documenting groups of previously unknown facts, such as 
geographic location and brand preferences), and forecasting (discovering patterns 
from which one can make reasonable predictions regarding future activities, such as 
the prediction that people who join an athletic club may take exercise classes ). 2 

As an application, compared to other data analysis applications, such as 
structured queries (used in many commercial databases) or statistical analysis 
software, data mining represents a difference of kind rather than degree. Many 
simpler analytical tools utilize a verification-based approach, where the user develops 
a hypothesis and then tests the data to prove or disprove the hypothesis. For 
example, a user might hypothesize that a customer who buys a hammer, will also buy 
a box of nails. The effectiveness of this approach can be limited by the creativity of 
the user to develop various hypotheses, as well as the structure of the software being 
used. In contrast, data mining utilizes a discovery approach, in which algorithms can 
be used to examine several multidimensional data relationships simultaneously, 
identifying those that are unique or frequently represented. For example, a hardware 
store may compare their customers’ tool purchases with home ownership, type of 



1 Two Crows Coiporation, Introduction to Data Mining and Knowledge Discovery, Third 
Edition (Potomac, MD: Two Crows Corporation, 1999); Pieter Adriaans and Dolf Zantinge, 
Data Mining (New York: Addison Wesley, 1996). 

2 For a more technically-oriented definition of data mining, see [http://searchcrm 
.techtarget.com/gDefinition/0, 294236, sidl l_gci21 1901.00.html]. 
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automobile driven, age, occupation, income, and/or distance between residence and 
the store. As a result of its complex capabilities, two precursors are important for a 
successful data mining exercise; a clear formulation of the problem to be solved, and 
access to the relevant data. 3 

Reflecting this conceptualization of data mining, some observers consider data 
mining to be just one step in a larger process known as knowledge discovery in 
databases (KDD). Other steps in the KDD process, in progressive order, include data 
cleaning, data integration, data selection, data transformation, (data mining), pattern 
evaluation, and knowledge presentation. 4 

A number of advances in technology and business processes have contributed 
to a growing interest in data mining in both the public and private sectors. Some of 
these changes include the growth of computer networks, which can be used to 
connect databases; the development of enhanced search-related techniques such as 
neural networks and advanced algorithms; the spread of the client/server computing 
model, allowing users to access centralized data resources from the desktop; and an 
increased ability to combine data from disparate sources into a single searchable 
source. 5 

In addition to these improved data management tools, the increased availability 
of information and the decreasing costs of storing it have also played a role. Over the 
past several years there has been a rapid increase in the volume of information 
collected and stored, with some observers suggesting that the quantity of the world’s 
data approximately doubles every year. 6 At the same time, the costs of data storage 
have decreased significantly from dollars per megabyte to pennies per megabyte. 
Similarly, computing power has continued to double every 18-24 months, while the 
relative cost of computing power has continued to decrease. 7 

Data mining has become increasingly common in both the public and private 
sectors. Organizations use data mining as a tool to survey customer information, 
reduce fraud and waste, and assist in medical research. However, the proliferation 
of data mining has raised some implementation and oversight issues as well. These 
include concerns about the quality of the data being analyzed, the interoperability of 
the databases and software between agencies, and potential infringements on privacy. 
Also, there are some concerns that the limitations of data mining are being 
overlooked as agencies work to emphasize their homeland security initiatives. 



3 John Makulowich, “Government Data Mining Systems Defy Definition,” Washington 
Technology, 22 February 1999, [http://www.washingtontechnology.com/news/13_22/tech_ 
features/393-3. html]. 

4 Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques (New York: 
Morgan Kaufmann Publishers, 2001), p. 7. 

5 Pieter Adriaans and Dolf Zantinge, Data Mining (New York: Addison Wesley, 1996), pp. 
5-6. 

6 Ibid., p. 2. 

7 Two Crows Corporation, Introduction to Data Mining and Knowledge Discovery, Third 
Edition (Potomac, MD: Two Crows Corporation, 1999), p. 4. 




